Characterization and Evaluation of Hardware Loop Unrolling
نویسندگان
چکیده
General purpose programs contain loops that cannot be optimized by a compiler. When the body of a loop contains conditional control flow instructions or if the loop is controlled by a non-constant induction variable, the compiler again cannot unroll this loop. We have found that the compiler cannot unroll greater than 40-50% of the static loops in the set of programs studied. To be able to optimize the execution these loops, we have to detect loop behavior at runtime. We propose that loops that cannot be optimized by the compiler should be detected and unrolled using a hardware-based unrolling mechanism. Our design exploits the temporal locality found in loops to provide a higher degree of instruction level parallelism. Using hardware-based unrolling, multiple basic blocks can be retrieved from a dedicated loop cache, reducing the number of instruction cache and memory requests, while providing a large window of instructions for speculative execution. Before designing our hardware mechanism, we characterized all loops that cannot be optimized by the compiler. Using these characteristics we construct the design of a hardware mechanism that will allow us to unroll loop iterations dynamically. To drive our prediction mechanism, we use correlation between the pattern of branch outcomes that lead up to a loop with the path of branches executed within the loop body. We capture a history of the sequence of paths followed in a loop to predict the entire loop visit. We can then unroll entire loop bodies without the aid of the compiler. To characterize loop execution and evaluate the effectiveness of the proposed mechanisms, we study three different sets of benchmarks: mediabench, mibench and a subset of SPECint2000 (the loop intensive benchmarks). Our results show that hardware-based loop unrolling can be performed dynamically and provides us with new levels of instruction-level parallelism. We have found that we can consistently increase the IPC using this mechanism, achieving maximum speedups greater than 20%.
منابع مشابه
Quantitative Evaluation of Behavioral Synthesis Approaches for Reconfigurable Devices
State-of-the-art behavioral synthesis tools for reconfigurable architectures barely have high-level transformations in order to achieve highly parallelized implementations. If any, they apply loop unrolling to obtain a higher throughput. In this paper, we use the PARO behavioral synthesis tool which has the unique ability to perform both loop unrolling or loop partitioning. Loop unrolling repli...
متن کاملExtending Loop Unrolling and Shifting for Reconfigurable Architectures
Loops are an important source of optimization. In this paper, we propose an extension to our work on loop unrolling and loop shifting for reconfigurable architectures. By applying unrolling and shifting to a small loop containing a hardware kernel and some software code, we relocate the function calls contained in the loop body such that in every iteration of the transformed loop, software func...
متن کاملInner Loop Optimizations in Mapping Single Threaded Programs to Hardware
In the context of mapping high-level algorithms to hardware, we consider the basic problem of generating an efficient hardware implementation of a single threaded program, in particular, that of an inner loop. We describe a control-flow mechanism which provides dynamic loop-pipelining capability in hardware, so that multiple iterations of an arbitrary inner loop can be made simultaneously activ...
متن کاملA Simulation Methodology for Software Energy Evaluation
We describe a comprehensive simulation methodology and tool for evaluation of software energy for the pipelined DLX processor. Energy models for each module of DLX are built and the energy is evaluated during run time execution. The input to the simulator are the instructions of the program and the simulator estimates energy of each micro-instruction using the energy models. Our simulator allow...
متن کاملMemory System Energy: Influence of Hardware-Software Optimizations
Memory system usually consumes a signi cant amount of energy in many battery-operated devices. In this paper, we provide a quantitative comparison and evaluation of the interaction of two hardware cache optimization mechanisms (block bu ering and sub-banking) and three widely used compiler optimization techniques (linear loop transformation, loop tiling, and loop unrolling). Our results show th...
متن کامل